Entry Name: "IIITH-NAVYA-MC3"

VAST Challenge 2014
Mini-Challenge 3

 

 

Team Members:

Navya Yarrabelly, International Institute of Information and Technology - Hyderabad, yarrabelly.navya@students.iiit.ac.in PRIMARY
P Yashaswi, International Institute of Information and Technology - Hyderabad, p.yashaswi@students.iiit.ac.in
Veera Raghavendra Chikka, International Institute of Information and Technology - Hyderabad, raghavendra.ch@research.iiit.ac.in
Kamalakar Karlapalem (Advisor), International Institute of Information and Technology - Hyderabad, kamal@iiit.ac.in

Student Team: YES

 

Team Number: 37

 

Streaming User ID: yarrabelly.navya@students.iiit.ac.in

 

Analytic Tools Used:

QGIS, for plotting the geospatial data locations and paths
R, to analyze the graphs
D3.js
JavaScript
TwitInfo, adapted by the team for the challenge
Weka, for clustering and analyzing the clusters visually
Stanford NER, for named entity extraction
Tweet NLP,for POS tagging of microblog data

 

Approximately how many hours were spent working on this submission in total?

150 hours

 

May we post your submission in the Visual Analytics Benchmark Repository after VAST Challenge 2014 is complete?
YES

 

 

Video:

http://youtu.be/5hpePxj6vP0

 

VAST-2014-MC3Video

 

 

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Questions

 

Please note - this challenge contains a question that is time-dependent.\A0 Within 3 hours of starting the final data stream, send an email to VASTChal2014MC3@vacommunity.org containing your answer to question MC3.1.\A0 Please include a copy of your answer to MC3.1 in your final answer form also. Your answers to MC3.2 and MC3.3, along with your video, are due July 8.

 

The responses to these questions should be incrementally built, as you (the contestant) acquire information from each streaming data segment you receive.  Your submission will answer these questions in consideration of all of the streaming data segments.

 

 

MC3.1 - Within 3 hours after start the final data stream, send an email to VASTChal2014MC3@vacommunity.org containing:

1.       An image showing the streaming data in your visual analytics tool. In this image, identify an event of interest that you intend to investigate further.

2.       The content of the final message in the data stream

 

a.      Identified event is "Fire at Dancing Dolphin" . In figure 1 on clicking the event , it will highlight all the occurences in the stream and showing the event-related information.

Figure 1 :Image showing streaming of the data along with the trends(bigrams) sized from large to small and coloured as a gardient of red to green in order of its magnitude of populairity.

b.      The content of the final message in the stream is : "RT @KronosStar There has been an explosion from inside the apartment building. Several people are down. #KronosStar #DancingDolphinFire #AFDHeroes ".

 

 

 

 

MC 3.2 - Describe the timeline of up to five major events that you discover in the streaming data. This timeline should include information from all three segments of the data stream if needed.    Use specific microblog messages and call center data to support your description, but do not simply mimic back the data stream.  Provide a concise description of important participants, locations and durations.  Focus your response on the events themselves, rather than on the individuals reporting the events. Please limit your answer to no more than ten images and 1500 words.

 

1.      Event Detection:
To detect events from the given data, we used an adaptation of 'bursty words' algorithm.

1.      From the given data of microblog messages, we identified a set 'S' of textual features as frequently occuring bigrams with a minimum threshold 'T' over the given time period.

2.      Each microblog message is considered as a transaction(Item Set) with items as the textual features identified, ignoring non-textual features. From this set of transactions, we found another set 'D' of disjoint frequent item sets i.e features which have co-occured frequently in the microblog messages, with a threshold over 3*T/4 .

3.      We carried out a clustering process on the set 'D' with similarity metric as follows
let (k1, k2, k3) be a transaction T1 and (k2, k4) be a transaction T2 and n(ki,kj) represents the number of microblog messages in which both the features ki and kj have occured.
Similarity(T1,T2) = n(k1,k2)+n(k1,k4)+ n(k1,k2) + n(k2,k4) +n(k3,k2)+ n(k3,k4)(n(k1) + n(k2)+ n(k3) + n (k4)) .
At the end of the clustering process, each cluster represents one event. From these clusters we picked top-5 largest clusters, each representing an event and each point in the cluster is a microblog message that describes the event.All the data-points in a cluster are taken as event related transactions.With the mapping from a message to a transaction, all the microblog messages, whose correpsonding transactions belong to the event cluster are taken as event realted microblog data. Then an event is denoted by the most frequent bigram from its related microblog data.

5 major events identified from the streaming data are

    1. pok rally
    2. dancing dolphin
    3. black van
    4. shots fired
    5. suspects arrested

Figure 2 :Image showing events identified and features of the events. Size of the circle represents the popularity of the event.

2.      Event Timeline :
For each event 'E' we collected all the related microblog messages as mentioned above. We divided the time(270 minutes) into intervals of 10 minutes size and a graph is plotted with number of event-related microblog messages in a given interval vs time. From this graph, we identified the peaks using 'Twit Info peak detection algorithm'. Each peak in the graph represents a sub-event, which is described by a set of most-frequent unigrams and bigrams crossing a threshold 'T2' and the duration of the sub-event is given by the time-intervals during which the event has reached a peak. Also, we collected the microblog messages related to the sub-event as the message containing the largest number of unigrams and bigrams which describe the sub-event.The complete set of microblog messages which describes all the sub-events is taken as evidence to describe the timeline of the event.

Figure 3 :Image showing Timeline of the events, with their peaks marked in the graph-number of event related messages per unit time-interval vs time.

Figure 4 : Image showing the timeline of all events with supported description from microblog data and call center data

3.      Event duration:
For each event, we sort all its sub-events by their respective start times.Let 'SE' denote the sorted list of sub-events,then the start-time of an event is given as the time of the first microblog message in the time interval of the its first sub-event and the end-time of the event is given as the time of the last microblog message in the time interval of the last sub-event from the sorted sub-events list 'SE'.

Figure 5 :Image showing Duration of the events.

4.      Event Location:
The call center messages sent during the peak time-intervals of an event are considered as related call center data of that event. From this data, we manually filtered the messages which are not related to the event (ex: 'TRAFFIC STOP'). Then the locations are plotted using QGIS tool. We indexed these locations w.r.t time-intervals of 10 minutes and considered only those locations which are in vicinity range w.r.t other locations in its preceding and succeding time intervals. For static events(ex:"pok rally") we took only one location which has the highest frequency from the final filtered list of locations. For dynamic events (ex:"black van") we took all the locations in the final filtered list of locations.

Figure 6 :Image showing locations of the events.The colour of the location is a gradient of red to green in the order of time at which the event has took place. Alternatively, the label of the event is prefixed by a serial number which shows the same. Hovering over the location gives the name of the location. No location has been identified for the event "suspects arrested"

5.      Event Participants:
We used CMU Twitter NLP and Part Of Speech tagging tool to tag the microblog data for each event. From this we extracted all the longest consecutive tags with 'NNP', which is the candidate set of participants. From this set we removed the locations of the events identified above and extracted the participants through manual inspection. In addition to that, we also used Stanford NER tool to extract the named entities and added the phrases identified as PERSON and ORGANISATION to the list of participants.

Event

Participants

pok rally

POK leader Sylvia Marek, Dr. Audrey McConnell Newman award-winnign activist, Lucio Jakab, Viktor-E, Abila police, POK community

dancing dolphin

Abila Fire Department, Abila Police Department

black van

Black van guys, gunmen, hostages

shots fired

olice officer, Abila Police SWAT team

suspects arrested

APD

6.      Table 1: Table showing the events and the participants involved.

Figure 7 :Interface to select an event and view the timeline of an event with supporting microblog and call center data along with the locations and particpants of the event. The above image is a snapshot of the interface for the event "black van"

 

 

 

 

 

MC 3.3 \96 Select one of your five major events from question MC 3.2 that you consider to be most likely to provide additional clues to the investigation of the GASTech disappearances.   Describe the roles of the participants.  Describe how other events you identified in MC3.2 may have influenced your selected event. Provide a hypothesis and evidence as to whom you suspect as being directly involved in the GAStech disappearances, either as perpetrators or victims.  Please limit your response to no more than five images and 500 words.

 

As per our hypothesis, POK is NOT involved in the GAStech disappearances.
Of the 5 events identified, we would pick the event "black van" as a crucial event to provide further evidence to the GAStech disappearances.
Roles of the partcipants :
The gunmen and hostages in the black van is to create terror among the people. The hostages could also be the members of POK or from the people present at the rally.
Influence of other events:
From Figure 7, the black van has started at the location of the dancing dolphin and ended at the location very close to that of POK rally, where both places being the locations of the other two major events "pok rally" and "dancing dolphin fire" is suspicious. Fire at dancing dolphin, during the pok rally cannot be a coincidence. It could serve as a purpose to provide distarction from POK rally and also for the dispersal of the police to the other end of the city, which is evident from Scene 2 of FIGURE 7 and also from FIGURE 8. From the timeline of the events, the movement of the black van has started after the evacutaions has started at the dancing dolphin, allowing the black van to move freely. The accident at Schaber Ave could also be intentional , so that they could be pursued by the police and at the same time it would create some terror among the people of abila. From the location of the event "shots fired", it is very close to that of the location of POK rally, causing riots at the rally, which has been peaceful till then. As POK members would not have any motive to create riots at their own rally, black van guys are not related to POK members.So the suspicious black van could kidnap POK members while creating a fire at dancing dolphin, would provide a perfect reason to cause terror in the city.By this analogy the black van could be viewed as a perpetrator behind the GAStech disappearences also, as it fits well into the scenario and has the motive to do so. We would get the further clues on evaluating the genunity of the event "suspects arrested"

Figure 7 :Image illustrating the complete scenario of the events

Figure 8 :Image showing that the other events have contributed for the distraction from pok rally